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ABSTRACT 

Analysis of Covariance (ANCOVA) is a data analysis method that is often 
employed to control extraneous sources of variation in non-equivalent group 
designs. It is commonly believed that so long as the covariate is highly correlated 
with the dependent variable there is nothing to lose in employing ANCOVA, even 
in non-randomized studies. This paper examines some of the conditions that lead 
to successful and unsuccessful criterion source adjustments, and demonstrates that 
under certain circumstances, ANCOVA may perform in a manner antithetical to its 
intended purpose. 
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CONFOUNDING COVARIATES IN NONRANDOMIZED STUDIES 

INTRODUCTION 

The analysis of covariance (ANCOVA), as employed in educational research practice, is 
routinely used for one or both of two purposes. The first of these purposes is to attain an in- 
crease in the power of a statistical test. As an example, a researcher might randomly assign stu- 
dents to various treatment groups with subsequent outcome measures being evaluated by means 
of an analysis of variance (ANOVA). In the event that ancillary information pertaining to the 
students is available in the form of measures that (a) correlate with the outcome measures and 
(b) do not reflect treatment effects, then the power of the ANOVA test may be augmented by 
the introduction of a covariate, similar to the use of "blocks" in different design contexts. Under 
random assignment covariate scores for students in various treatment groups are sampled from 
identical populations. 

Covanates are also commonly employed to adjust criterion measures so as to ameliorate 
group differences that 3re unrelated to treatments. That is, the second use of ANCOVA is to 
control an extraneous variable. For example, an educational researcher may be unable to ran- 
domly assign students to treatments and subsequently become aware of differences between the 
groups in terms of intellectual ability. In the event chat outcome measures are related to intellec- 
tual ability (e.g., scores on a reading test), then the researcher might employ IQ scores as a 
covariate in the model in order to control for group differences in intellectual ability. Unlike 
the first use of ANCOVA, in this instance groups do differ on the covariate measure and it is 
precisely because of this difference that the covariate is used. For a further discussion on this 
topic and related issues, see Cochran (1957, 1970), Elashoff (1969), Evans and Anastasio (1968), 
Fisher (i932), Harris (1963), Levin and Subkoviak (1977), Linn (1981), and Lord (1960). 
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The use of covariates is not a substitute for random assignment of experimental units, 
but its proliferation is apparently promoted by the belief that covariates in non-equivalent 
group designs can only improve the level of precision in the data analysis. The argument con- 
tends that % at best, in nonrandom assignment ANCOVA will control at least some of the sources 
of extraneous variation, and at worst, will not be biased from traditional ANOVA results. This 
in turn can only lead to greater confidence in the validity of results than would have been real- 
ized had the covariate not been employed in the model. While it has been clearly pointed out 
that, "ANCOVA provides the appropriate adjustment only under a very limited set of condi- 
tions" (Porter and Raudenbush, 1987, p. 390) and randomization is a primary condition, the 
propensity of usage in nonrandomized studies in education suggests that it is not commonly 
known that under certain circumstances the use of ANCOVA will result in the introduction of ex- 
traneous influences into the analysis. Not only does the ANCOVA fail to provide precision in 
this situation, but it will operate in a manner antithetical to its purpose. 

PURPOSE OF THIS PAPER 
The purpose of this paper is to explicate and focus attention on the problem of unsuc- 
cessful adjustments made in data analysis through misuse of covariates. It will be helpful to 
review some basic assumptions of psychometric test score theory. This background will form the 
basis for the subsequent discussion. Then, examples of successful and unsuccessful criterion 
score adjustments in situations where the null hypothesis of no treatment effect, as well as in 
situations where the null hypothesis is false, will be presented. 

CLASSICAL TEST SCORE THEORY 
Classical test score theory conceives of the raw score earned by a student on a given test 
as a function of two basic components, expressed as: 

2 
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x i = M * e i 



(i) 



Because of its error component (e^), Xj is often referred to as the observed or fallible score 
earned by the ith student on the test. In contrast, tj refers to the ith student's true score and rep- 
resents the student's actual ability in the subject matter. If the student's ability does not change, 
then this quantity will remain stable from test to test so long as the tests measure the same sub- 
ject matter on the same scale. The error component represents those random influences that in- 
flate or deflate a test score, but are unrelated to the true score. The Cj are generally taken to 
have a mean of 2ero (Gulliksen, 1950), 

There are other elements of test scores that fit neither of the above categories. These ele- 
ments do not represent the student's ability in the subject matter, and yet are stable rather than 
random elements. Classical measurement theory expressed in (I) can therefore be expanded ;o: 

x i" ! i + c ii +c 2i*" f c ki + e i W 

where Cjj, ... c^j represent the various components of this type that contribute to the fal- 
lible score of the ith student. (Some simplification of the theory has been made here,) A beLer 
understanding of a possible Cj can be gained by briefly examining a common example called 
"test-wiseness". 

Test-wiseness has been defined as a student's ability to use the characteristics or format 
of a test to obtain a higher score (Millman, Bishop, and Ebel, 1965). It is independent of the 
student's knowledge of the subject matter. Test-wiseness can be seen in students who have had 
extensive experience with a particular type of test. For example, students taking multiple- choice 
tests soon learn to look for clues that will allow them to eliminate otherwise attractive foils. In 



this situation a test score reflects not only a student's level of knowledge of the subject matter, 
but test tak ; ng skills as well. Other examples of score components that are unrelated to the in- 
tended object of a particular measurement process are discussed by Bajtelsmit (1975), Diamond 
and Evans (1972), Hall, Follman. and Fisher (1987), Kirkland and Hollandsworth (1980), 
Millman (1966), Rowley (1974), Sarnacki (1979), Stanley (1971), and Wigdor and Garner (1982). 

EXAMPLES OF COVARIATE ADJUSTMENTS 
The following examples of a single-factor ANCOVA are based on the linear model 

Vjj - n ♦ otj ♦ 0 (xjj - x) + ejj (3) 

i = 1 if 

j = 1 C 

where n is the grand mean, a is the effect due to the treatment, 0 is the regression of y on x, 
and *» refers to the error component. 

Example 1: Successful Adjustment When H Q Is True 

Suppose that an educational researcher has designed a study to determine which of two 
methods of teaching arithmetic skills produces better results as measured by scores on a standard- 
ized arithmetic test. Practical considerations require the researcher to use two previously formed 
classes as a control (Class A) and experimental group (Class B). A pretest -posttest design is used 
with an intervening period of instruction being provided to the two classes. Suppose further that 
because of the manner in which the two classes were originally formed. Class A has a higher 
mean arithmetic ability than does Class B. Noting this, the researcher decides to test for treat- 
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ment effects through the use of an ANCOVA model in which the pretest arithmetic test score is 
to be used as the covariate. The attempt here is to control for the initial difference in arithmetic 
ability between the two classes. 

Several hypothetical data sets were constructed with characteristics shown in Table 1 to 
illustrate this as well as the following analyses, . The information in this table indicates that the 
dependent and covariate measures for the two classes are made up of true and error score com- 
ponents only. The table also shows that true scores for Class A were sampled from a population 
with a mean of one, while those for Class B came from a population of a mean of zero. The 
column headed £ (for other components) reflects the fact that the scores contained no other com- 
ponents. The last column shows that no treatment effect (in the form of a constant to produce a 
shift in location parameter) was added to the scores of either group. The e } j and e->- for all ex- 
amples were sampled from distributions with a mean of zero and variance of one. All data sets 
were composed of 70 observations (35 per class) with random variates being sampled from nor- 
mal populations with means and variances as shown in the table. 

In order to further simply the discussion. Table 2 shows the various models based on (3) 
that were used in this example and subsequent analyses. In keeping with common practice, least 
squares regression models (with intercepts) were used for all analyses. In Table 2, d£fi and qqv 
represent, respectively, the dependent and covariate variables. In the models, &re is a dummy 
vector (I or 0) representing group membership, and im is the product of cov and grp. 

Returning to the example, a test of £j in model (a) and £ 3 in (c) produces (rounded) e 
values of .000 and .999, respectively. This indicates a relationship between the covariate and de- 
pendent variables and leaves as tenable the hypothesis of homogeneous regressions for the two 
classes. Of primary interest is the test of 0 2 m < b ) w hich yields fi = .682, thereby leaving 
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tenaole the hypothesis of no treatment effect. Had the covariate measures not been available, a 
test of 0| in (d) (i.e., an independent samples I test) would have produced a significant result (p 
= .000), thereby leading to the erroneous conclusion of a difference between the effectiveness of 
instruction methods. In Example I, therefore, the use of the covariate led to a correct assess- 
ment of no treatment effect, while an analysis performed without the covariate led to the op- 
posite and erroneous conclusion. 

Example 2: Successful Adjustment When H Q Is False 

The data used in this example are the same as those from Example 1 with the exception 
that a treatment effect (modeled, with a constant of 1.0, as a shift in location parameter) was 
added to the dependent variable scores of the members of Class B. Thus, the instructional 
method used in Class B was superior to that used in Class A. The ANCOVA analysis of this 
data reaches the same conclusions as in Example 1 regarding the preliminary tests, but produces 
C * .000 for the test of 0 2 m ( b >- Thus, the ANCOVA analysis correctly detected the presence 
of a treatment effect. On the other hand, a test of £j in (d) (i.e., where the covariate is not 
taken into account) results in a nonsignificant q value of .857, thereby failing .o detect the treat- 
ment effect. 

1.1 the above examples, the ANCOVA models performed appropriately. In the examples 
that follow, however, results generated by these methods were inappropriate. 

Example 3: Unsuccessful Adjustment When H Q Is True 

Suppose in the situation outlined in Example 1 that no pretest scores were available for 
use as a covariate. Rather, scores from a different mathematics test (call this Test D) were avail- 
able. Unlike the test used 'o measure the outcome variable, Test D does not assess arithmetic 
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skills by simple presentation of calculational problems, but instead uses word problems as the 
medium of presentation. Both tests measure arithmetic ability, but they do so through different 
forms of problem presentation. 

Even though tests of the form represented by Test D measure what they purport to 
measure (i.e., arithmetic ability in this case), they also reflect a student's reading ability, (In ad- 
dition, there might be other components not shared by these tests, such as reasoning ability and 
so forth.) This arises because problems must be read and understood before calculations can be 
carried out. In this example it is assumed that the two classes have approximately the same arith- 
metic skills, but Class A has higher mean reading ability than does Class B. Table 1 reflects this 
by indicating that reading components (c) of the covariate scores for Class A were sampled from 
a population with a mean of one and those for Class B had a population mean of zero. In this 
example reading ability is taken to be independent of arithmetic skill and is extraneous to the 
experiment. 

The test of £j in (a) and 0 3 in (c) results in q values of .000 and .949. respectively. The 
test of /?2 in (b) yields a n value of .027, leading to the incorrect conclusion that a treatment ef- 
fect is present in the data. In this example, the correct conclusion of no treatment effect is ob- 
tained by the test on 0j in (d) which gives p. = .857. 

Unlike the first two examples, use of a covariate in this case led to an incorrect conclu- 
sion, while analysis performed without the covariate resulted in a correct conclusion. The ex- 
planation is straightforward; the covariate adjustment was made on the basis of the difference 
in reading levels of the two classes. But, reading level is unrelated to the dependent variable in 
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this example and is therefore irrelevant to the study The important point here is that rather 
than limiting or reducing the influence of some extraneous variable, the covariate acted to intro- 
duce a confounding variable into the data analysis. 

Example 4: Unsuccessful Adjustment When }\ Q Is False 

In this example a constant (treatment effect) of .5 was added to the dependent variable 
scores of the students in Class B with scores being the same as those used in Example 3. Conclu- 
sions reached on the two preliminary tests are the same as those reached in the previous three 
examples. The test of fa in (b) is nonsignificant with £ » .814, while that of £ { in (d) yields e 
= .021. In this case, the model with a covariate failed to detect the presence of the treatment ef- 
fect. It was detected, however, when the covariate was removed from the analysis. 

COMMENTS 

Rather than excluding the influence of confounding variables, ANCOVA may serve to 
introduce confounding variables into the analysis. This circumstance may occur when covariates 
reflect differences between groups that are unrelated to outcome measures. Educational 
researchers should be particularly aware of this problem when covariate and dependent variables 
are measured on different scales or when these measures are obtained under different sets of 
conditions. An example of the former situation occurs when one measure is obtained by observ- 
ing students at some task, but the other is obtained through administration of a test that assesses 
knowledge of how the task is performed. The latter situation may occur when, for example, one 
measure is timed but the other is not. In this case one measure reflects not only a student's 
ability to perform, but also the rapidity with which the student performs. 
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The possibility of a covariate measure bringing about a confounding of results may be 
exacerbated by implementing techniques from the growing body of literature on the effect on 
ANCOVA when the covariate measure is fallible. This is the so-called "fallible covariable 
problem in which the reliability of the instrument from which the data were collected on the 
covariate is less than i.00. (See, e.g., Carroll, Gallo, & Gleser, 1985; DeGracie & Fuller, 1972; 
Lord, 1960; R3aijmakers & Pieters, 1987; Rogers and Hopkins, in press, 1990; and Stroud, 
1972.) Ironically, making adjustments to take into account reliability of an inappropriate 
covariate measure serves to increase the confounding effect of the covariate. 

Finally, we note that educational researchers often exercise great care in collecting and 
scrutinizing dependent measures, but may fail to maintain the same level of care when dealing 
with covariate measures. This seems to stem from the mistaken belief that appropriate adjust 
ments will be made whenever groups differ on a covariate that is highly correlated with the de- 
pendent measure. This paper has demonstrated otherwise. 
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TABLE 1 

Characteristics of Data Sets Used In Examples 1-4. 



Example 



Score 

Qass to £0v 



A 
B 



V e ii 



V e ii 



r i +e 2i 



0;l 



0.0 
0.0 



A 
B 



<i +e li 



T i +e 2i 



<i +e 2i 



1:1 
0;1 



0.0 
1.0 



A 
B 



l i +e li 



<i +e li 



t:+C:+e->: 

1 I 4.1 



t i+ c i+ e 2i 



0;1 



1:1 



0;I 0;1 



0.0 
0.0 



A 
B 



l i +e li 
t ; +tr+e 



li 



t i+ c i+ e 2i 



0;1 l;l 
0;1 0;1 



0.0 
0.5 



NOTE: dep = dependent variable, cov = covariate, ma(t) = mean and standard deviation of true 
score, n\c(c) = mean and standard deviation of component score, tr = treatment effect. 
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Table 2 

Least Squares Models Used In Example Analysts. 



Model 

Designation Mp<jej 

(a) dep = 0 O + tfjcov 

(b) dep = 0 O + /Sjcov + £ 2 gn> 

(c) dep = 0q + 0jcov + /?2§ r P + / ? 3' nt 

(d) dep - 0q + £ { grp 



NOTE: dep * dependent variable, cov = covariate, grp = group membership, int = product of 
covariate and group membership. 
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